Racial and Ethnic Bias in Risk Prediction Models for Colorectal Cancer Recurrence When Race and Ethnicity Are Omitted as Predictors

Key Points Question Is omitting race and ethnicity as a predictor in colorectal cancer recurrence risk prediction models associated with racial and ethnic bias? Findings In this prognostic study with 4230 patients with colorectal cancer, 4 prediction models for risk of postoperative cancer recurrence were developed and validated. Explicitly considering race and ethnicity as a predictor improved model predictive performance among racial and ethnic minority patients and increased algorithmic fairness in multiple performance measures. Meaning These findings suggest that implementing clinical risk models that simply omit race and ethnicity may result in worse prediction accuracy for racial and ethnic minority groups that may lead to inappropriate care recommendations that ultimately contribute to health disparities.


eFigure 1. Inclusion and Exclusion Criteria
Description: We excluded patients diagnosed with appendix cancer, had a previous cancer diagnosis within one year prior to their CRC diagnosis, had no KPSC membership within 90 days of CRC diagnosis, or whose adjuvant treatment duration was unusually long (i.e., > 1 year). Cancer surveillance start was defined as 90 days after the end of primary surgery or adjuvant treatment. We excluded patients who died, had a second cancer diagnosis, had a CRC recurrence, initiated hospice, or whose membership ended prior to their surveillance start date. To avoid the misclassification of CRC recurrence, we further excluded patients who received chemotherapy associated with metastatic cancer (i.e., capecitabine, oxaliplatin, 5-Fluorouracil, or irinotecan) within 180 days of their cancer resection but had no other indicator of recurrence, and those who received radiation associated with metastatic cancer but without any chemotherapy within 180 days of their cancer resection. We also excluded those with inconsistent N stage and number of positive lymph node values, unknown T-stage, unknown number of nodes examined, or had a non-zero number of nodes examined but had unknown number of positive nodes. Finally, we excluded individuals in the "multiracial or other" racial/ethnic group due to small sample size.
Handling of Missing data: There was no missing outcome status as we relied on a validated algorithm to identify recurrence outcomes using healthcare utilization patterns (see eTable1). Patients with missing predictor information (T-stage, number of nodes examined, or had a non-zero number of nodes examined but had unknown number of positive nodes) were excluded as shown in diagram above. Unknown Perineural Invasion status was captured using an indicator variable.

eAppendix 1. Approach to Ascertaining the Model Outcome
Patients were considered having a recurrence if they had any of the following: 1) A prescription for any of the following adjuvant CRC drugs (fluorouracil, oxaliplatin, capecitabine) more than 90 days after the end of adjuvant therapy; 2) A prescription for any of the metastatic CRC drugs (irinotecan, cetixumab, panitumumab, bevacizumab, aflibercept, ziv-aflibercept, regorafenib, trifluridine, ramcirumab, nivolumab, pembrolizumab) anytime; 3) A prescription for any anti-cancer therapy associated with a metastatic ICD diagnosis code (ICD9: 197, 198, ICD10: C78, C79) anytime; 4) Received radiation therapy more than 90 days after the end of adjuvant therapy; 5) A primary CRC surgery procedure ≥ 225 days (7.5months) after KPSC Cancer Registry surgery date; 6) A metastatic surgery procedure; 7) Any imaging performed associated with a metastatic diagnosis, defined by having any imaging impression text in the exam summary from the radiologist that mentioned potential recurrence or evidence of metastatic disease and at least one occurrence of a metastatic cancer diagnosis code (ICD9: 197, 198, ICD10: C78, C79) within 30 days of the imaging date in the patients' history or encounter records; or 8) A hospice referral with a metastatic ICD diagnosis code.
A detailed chart review was performed in a random sample of 315 individuals to validate the recurrence outcome captured using this algorithm. Overall accuracy of the utilization-based recurrence outcome was high (positive predicted value 90%; negative predicted value 97%) and comparable to that found in other studies. 1,2

eAppendix 2. Model Development Details
We applied four prediction modeling strategies that differed in how they handled the race/ethnicity variable. All models used Cox proportional hazards regression with time from the start of surveillance to recurrence as the outcome, with KPSC membership end, hospice initiation, second non-CRC primary cancer diagnosis, and end of study before recurrence treated as censoring events. Death before recurrence, a competing event, was infrequent (10%). We compared the risk estimates from the Cox model to those obtained using a competing risk regression (Fine and Gray) and saw minimal impact on estimates due to the relatively small proportion of patients who died before recurrence. Death was therefore treated as a censored observation to simplify the analysis.
For all models, we included variables previously shown to be predictive of cancer recurrence in the models. 3 The variables included were age, sex (male, female), cancer stage (AJCC v7), tumor histology, number of lymph nodes examined, positive node ratio (PNR), pathologic T-stage, tumor site (colon vs. rectum), adjuvant chemotherapy received, perineural invasion, and the interaction terms stage*adjuvant chemotherapy and stage*age. All covariates, except for adjuvant chemotherapy received, were measured at the time of diagnosis. All tumor information was obtained from the KPSC SEER-affiliated cancer registry. Tumor histology was defined using ICD-O-3 Histology codes: Non-mucinous adenocarcinoma (codes "8140", "8144", "8210", "8211", "8221","8255", "8260", "8261", "8262","8263", or "8574") and Mucinous neoplasms (codes "8480" and "8481"). The number of regional nodes found positive for cancer at pathological examination and the number of regional lymph nodes pathologically examined were obtained from the SEER Extent of Disease records. PNR was defined as the ratio of the number of positive lymph nodes to the total number of lymph nodes examined, which was calculated for patients with more than 12 nodes examined. Pathologic T-stage referred to T-stage per AJCC v6. Tumor site was identified using ICD-O-3 Site codes: colon (codes: C180, C182-189) and rectum (codes: C199, C209). The Collaborative Staging Site-Specific Factor 8 was used to identify perineural invasion status, which was dichotomize as Yes -Perineural invasion present vs. No -perineural invasion not present. All treatment information was extracted from pharmacy database and Electronic Medical Records. Receipt of adjuvant chemotherapy (Yes/No) was defined as the initiation of capecitabine, fluorouracil, or capecitabine within 90 days of surgery or radiation therapy (if received after surgery).
Race/ethnicity information was obtained from membership files, utilization data, preferred language, and birth certificates. 4 Self-reported race/ethnicity and official documents were given preference over other sources. Race/ethnicity categories included Non-Hispanic White, Hispanic, Black/African American, Asian/Hawaiian/Pacific Islander, and Multiracial or Other. There was no unknown or missing race/ethnicity. The "Multiracial or Other" subgroup was excluded from the analyses due to small sample size.

eTable. Statistical Criteria for Algorithmic Fairness Description
How it was calculated Equal Calibration within Groups 5, 6  For each possible predicted risk score, the proportion of patients experiencing a recurrence should be the same across racial/ethnic subgroups and equal to that risk score.  Motivated by the idea that fairness requires a given risk score to have the same evidential value regardless of racial/ethnic group. 7 For each racial/ethnic group, we plotted the observed Kaplan-Meier risks vs. the predicted recurrence risks across deciles of predicted risks. The predicted and expected risks were estimated using predictionSurvProb and calPlot (from the pec package). Calibration was assessed by the calibration intercept and slope. The intercept assesses calibration-in-the-large (or mean calibration), 8 with negative values suggesting overestimation and positive values suggesting underestimation. A slope <1 suggests that the estimated risks are too high for those with high risk and too low for patients at low risk. Slope >1 suggests that the risk estimates are too moderate. Equal Discriminative Ability 9  Motivated by the thought that a fair model should be able to correctly rank order individuals equally well between racial/ethnic groups.
Area under the receiver operating characteristic curve (AUC), which measures how well each model was at distinguishing between those with or without recurrence for each racial/ethnic group. Values range from 0 to 1. Value of 0.5 suggests that the model performs no better than chance; 0.7 to 0.8 is considered acceptable, > 0.8 is considered excellent. 10 Equal False-Positive and False-Negative Rates 6,11, 12  Among those who truly are without a recurrence, the proportion falsely predicted to be positive (falsepositive rate; FPR) should be the same across racial/ethnic groups.  Similarly, among those who truly had a recurrence, the proportion falsely predicted to not have a recurrence (false-negative rate; FNR) should be the same across racial/ethnic groups.  These two fairness criteria, sometimes referred to as "equalized odds", require that individuals from different groups with similar actual risk be treated the same by the algorithm. 11 We evaluated the FNR and FPR at a 5% risk threshold, reflecting a hypothetical clinical scenario where intensive surveillance may be recommended for patients whose risks of recurrence within 3 years exceed 5%. Note that a lower risk cutoff (i.e. recommending more intensive surveillance for a larger proportion of patients) may be of interest for clinical scenarios where sensitivity of the algorithm is critical -the harms of missing a recurrence far outweigh the harms of an unnecessary test. A higher threshold, in contrast, weighs the relative harm of a false positive higher. 6  Among those who were predicted to have a recurrence (defined by risk above a pre-defined threshold), the proportion who actually experienced a recurrence (Positive Predictive Value; PPV) should be the same across racial/ethnic groups.  Similarly, among those who were predicted to be recurrence-negative (defined by risk below or equal to a pre-defined threshold), the proportion who were actually recurrence-negative (negative predictive value, NPV) should be the same across racial/ethnic groups.  These two criteria are similar to criterion 1 in that they are motivated by the idea that fairness requires a positive or negative prediction to have the same evidential value across all groups.

Equal Positive Predictive Value and Negative Predictive Value
We evaluated the PPV and NPV at a 5% risk threshold.

eFigure 2. Comparison of Calibration Across Racial and Ethnic Groups in Each Model
The intercept assesses calibration-in-the-large (or mean calibration), with negative values suggesting overestimation and positive values suggesting underestimation. A slope <1 suggests that the estimated risks are too high for those with high risk and too low for patients at low risk. Slope >1 suggests that the risk estimates are too moderate. Values in brackets show the 95% confidence intervals obtained through 1000 bootstraps.
a Indicates that the 95%CI of the slope does not include 1; or the 95%CI of the intercept does not include 0. 95% CIs are obtained through bootstrapping.